fast-interp: relaxed-SIMD opcode lowering#4950
Open
matthargett wants to merge 9 commits into
Open
Conversation
…-SIMD
The relaxed-SIMD proposal — finalized as a wasm 2.0 extension — uses
the same 0xfd SIMD prefix and reserves sub-opcodes `0x100..0x113`
for its 20 new ops:
relaxed_swizzle, relaxed_trunc_{f32x4,f64x2}_{s,u},
relaxed_madd / relaxed_nmadd for f32x4 + f64x2,
relaxed_laneselect for i8 / i16 / i32 / i64,
relaxed_min / relaxed_max for f32x4 + f64x2,
relaxed_q15mulr_s,
relaxed_dot_i8x16_i7x16_{s,_add_s}.
This commit lays the loader-side validation needed to *recognize*
these opcodes without changing dispatch / runtime behaviour:
* `WASMSimdEXTOpcode` enum (wasm_opcode.h) extended with the 20
new constants at the spec-assigned values 0x100..0x113. Gated
behind `WASM_ENABLE_RELAXED_SIMD != 0` so a build without the
cmake flag (added in a follow-up commit) sees no new symbols
and the enum's storage is unchanged.
* `wasm_loader_find_block_addr` SIMD-prefix immediate skipper
(wasm_loader.c:8273-8363) — the inner switch is now on the
raw LEB-uint32 sub-opcode instead of the `(uint8)` cast, so
relaxed-SIMD sub-opcodes 0x100..0x113 reach their own case
labels instead of aliasing into legacy slots 0x00..0x13 and
triggering wrong `skip_leb_*` paths. Relaxed-SIMD opcodes
carry no immediates beyond the prefix, so the new cases just
`break` — listed explicitly so a future SIMD-spec assignment
in 0x100..0x113 doesn't silently fall through the default
branch and silently mis-skip an immediate. Cast assignment to
the outer `opcode` variable removed since it's no longer
used by the inner switch (the outer-function switch already
matched `WASM_OP_SIMD_PREFIX` and is inside that case).
* `wasm_loader_prepare_bytecode` SIMD-prefix type checker
(wasm_loader.c:16186+) — extended with type-signature case
labels for each relaxed-SIMD opcode. Three signature classes:
unary (1 v128 -> 1 v128): the four trunc variants.
binary (2 v128 -> 1 v128): swizzle, min/max, q15mulr,
dot_i8x16_i7x16_s.
ternary(3 v128 -> 1 v128): madd, nmadd, laneselect,
dot_i8x16_i7x16_add_s.
The 3-input ternary shape uses `POP_V128()` + `POP2_AND_PUSH`,
mirroring how `SIMD_v128_bitselect` handles its 3-input shape
today — no new stack-tracker macro needed.
* The trailing `default:` branch in the type checker keeps
rejecting unrecognized SIMD sub-opcodes with
`"invalid opcode 0xfd %02x."`, which now correctly surfaces
the full uint32 value (relaxed-SIMD opcodes 0x100+ are
rendered as e.g. `0xfd 100` — the `%02x` width is a minimum,
not a truncation).
The runtime executor (the actual case bodies in
`HANDLE_OP(WASM_OP_SIMD_PREFIX)` and the IR encoder widening
needed to reach them past the existing 1-byte sub-opcode read)
is the follow-up commit. Cmake `WAMR_BUILD_RELAXED_SIMD` flag
that flips `WASM_ENABLE_RELAXED_SIMD=1` is the third commit.
Built clean against `cd390ea0` with the flag absent — no
binary or behavioural change to existing SIMD code.
References:
https://github.com/WebAssembly/relaxed-simd/blob/main/proposals/relaxed-simd/Overview.md
https://github.com/WebAssembly/relaxed-simd/blob/main/proposals/relaxed-simd/_md/instructions.md
The 20 relaxed-SIMD ops (`0x100..0x113`) land as new case bodies
inside the existing `HANDLE_OP(WASM_OP_SIMD_PREFIX)` switch in
`wasm_interp_fast.c`. Each case follows the legacy SIMD-case
shape: pop the v128 operand(s) from `frame_lp`, hand them to a
SIMDe intrinsic (or a hand lane loop for the three SIMDe-missing
ops), write one v128 result.
To reach a case past 0xff the SIMD sub-opcode is widened from a
single byte to a little-endian uint16 in the IR. The loader emits
two consecutive bytes via `wasm_loader_emit_int16` (STORE_U16, no
padding even on platforms without unaligned access). The runtime
reads `frame_ip[0] | (frame_ip[1] << 8)` and switches over the
full `0x000..0x113` range. The widening is conditional on
`WASM_ENABLE_RELAXED_SIMD != 0`; when off, the IR is still
1-byte-per-SIMD-op via `emit_byte` and the runtime dispatch is
the legacy `GET_OPCODE()` 1-byte read — byte-identical to
upstream.
Per-case dispatch:
swizzle (i8x16 .relaxed_swizzle) DOUBLE
trunc_{f32x4,f64x2}_{s,u,_zero} (4 unary) SINGLE
{f32,f64}x_relaxed_{madd,nmadd} (4 ternary) TRIPLE
{i8,i16,i32,i64}x_relaxed_laneselect (4 ternary) TRIPLE
{f32,f64}x_relaxed_{min,max} (4 binary) DOUBLE
i16x8.relaxed_q15mulr_s (binary) hand loop
i16x8.relaxed_dot_i8x16_i7x16_s (binary) hand loop
i32x4.relaxed_dot_i8x16_i7x16_add_s (ternary) hand loop
SIMDe's `simde/wasm/relaxed-simd.h` (already shipped in
`core/deps/simde`) provides 17 of the 20 intrinsics; q15mulr_s,
dot_i8x16_i7x16_s, and dot_i8x16_i7x16_add_s are missing so the
dispatch loop inlines a per-lane C implementation. The relaxed-
SIMD spec allows implementation-defined behavior on overflow for
those three (wrap vs. saturate); the impls here match the
strict-IEEE / saturating shape — same as the corresponding
non-relaxed ops — which is conformant and matches the SIMDe
hand-coded fallbacks for q15mulr_sat_s.
A new local `SIMD_TRIPLE_OP(simde_func)` macro pops 3 v128s and
hands them to a 3-arg intrinsic; same shape as `SIMD_DOUBLE_OP` /
`SIMD_SINGLE_OP` for two- and one-arg ops. `#undef`-ed at the end
of the gated block so the macro doesn't leak into the legacy
build.
Smoke tested via a 6-op WAT module (swizzle, madd, min,
laneselect, q15mulr_s, trunc_f32x4_s) compiled to wasm and run
through the `iwasm` CLI with `WAMR_BUILD_RELAXED_SIMD=1`:
madd = [110, 240, 390, 560] ✓
trunc_f32 = [1, -2, 3, -4] ✓
min = [1, 2, 2, 1] ✓
q15mulr = [0,0,1,1,3,4,6,-7] ✓
swizzle = [15..0] (reverse) ✓
laneselect = (bitwise a/b mux per mask) ✓
The `wasm_loader_prepare_bytecode` SIMD switch type checker
(commit 1) is already populated for the new opcodes, so the
relaxed-SIMD wasm validates through the loader and then reaches
the new dispatch cases here. The cmake flag that exposes the
feature (`WAMR_BUILD_RELAXED_SIMD`) is the next commit; this one
adds the runtime side gated on the eventual macro.
Lights up the dormant `WASM_FEATURE_RELAXED_SIMD` bit at
`aot_runtime.h:32` for the fast interpreter. Default `0` so a
build that doesn't explicitly opt in stays byte-identical to
upstream — the loader + dispatch added in the two prior commits
all sit behind `#if WASM_ENABLE_RELAXED_SIMD != 0`.
* `WAMR_BUILD_RELAXED_SIMD=1` adds `-DWASM_ENABLE_RELAXED_SIMD=1`
to the C compile line and prints `"Relaxed SIMD enabled"` next
to the existing `"SIMD enabled"` line.
* `WAMR_BUILD_RELAXED_SIMD=1 WAMR_BUILD_SIMD=0` fails fast with
`FATAL_ERROR "WAMR_BUILD_RELAXED_SIMD=1 requires
WAMR_BUILD_SIMD=1"`. Relaxed-SIMD is a superset of the base
feature — the dispatch loop, frame_lp v128 cells, and SIMDe
intrinsics it shares with legacy SIMD would all be compiled
out otherwise.
* Listed in the existing "feature summary" block alongside
`"Fixed-width SIMD"` so `WAMR_INFO` output makes the new
knob visible.
Verified locally on macOS-15 / aarch64:
flag=0 (default):
iwasm -f madd /tmp/relaxed_smoke.wasm
-> WASM module load failed: invalid opcode 0xfd 100.
flag=1:
iwasm -f madd /tmp/relaxed_smoke.wasm
-> <0x4370000042dc0000 0x440c000043c30000>:v128
(correct f32x4 result for relaxed_madd)
flag=1 simd=0:
cmake -> "WAMR_BUILD_RELAXED_SIMD=1 requires WAMR_BUILD_SIMD=1"
(configure aborts)
The two macros `SIMD_V128_TO_SIMDE_V128` and `SIMDE_V128_TO_SIMD_V128`
punt 16-byte values between WAMR's `V128` union-of-arrays and
SIMDe's compiler-intrinsic vector type (`int32x4_t` on aarch64,
`__m128i` on x86-64) at every SIMD case boundary. The previous
shape used `bh_memcpy_s`, which lives out-of-line in
`core/shared/utils/bh_common.c`. Without LTO the call doesn't
inline, so every conversion compiled into a real `bl` instruction
— three function calls on 3-operand SIMD ops (madd / nmadd /
laneselect / bitselect / dot_add) plus one on the store, for ~4
function calls per SIMD dispatch.
xctrace CPU Counters on the aarch64 M4 E-core, matmul-fma
workload (the relaxed-SIMD f32x4_relaxed_madd hot loop):
before after
Useful 78.1% 71.4%
Processing 6.1% 23.3%
Delivery 13.4% 2.9% <- frontend stalls, the bottleneck
Discarded 2.4% 2.5%
total cycles 301M 733M (over 5s vs 10.9s, more iters)
The 13.4% `Delivery` share — frontend / L1-I stall — vanished:
the SIMD-prefix region's case bodies were big enough (~50
instructions per relaxed_madd dispatch, dominated by `bl
memcpy_chk` chains and intermediate v128 spills) to push the
SIMD switch out of L1-I on the E-core. After the fix each case
body is ~15 instructions, all register-resident, no calls.
Per-case disassembly (`f32x4_relaxed_madd`):
before after
~50 instructions ~15 instructions
3x bl memcpy_chk 0 calls
4x v128 stack-spill load/store 3 frame_lp loads,
1 frame_lp store,
1 fmla.4s
`wasm_interp_call_func_bytecode` total instruction count drops
from 14,560 -> 8,735 (40% smaller, comfortably inside the
Icestorm 128 KiB L1-I budget alongside hot non-SIMD ops).
End-to-end wallclock on M4 E-core (`cargo run --release --bin
bench_relaxed_simd`):
matmul simd128 (mul+add)
WAMR before: 1.490 ms median
WAMR after: 0.468 ms median (3.2x speedup)
Pulley: 1.217 ms median
matmul relaxed-simd (FMA)
WAMR before: 1.180 ms median
WAMR after: 0.369 ms median (3.2x speedup)
Pulley: 0.921 ms median
WAMR now leads Pulley on both shapes (1.27x faster on
matmul-simd128, 2.50x faster on matmul-fma), and WasmEdge
interp by 6-7x. The fix applies to *all* SIMD ops, not just
the relaxed-SIMD ones — the macros are on the hot path for
every f32x4 / i32x4 / v128.load / v128.store in the fast
interpreter.
Correctness: `_Static_assert` upgrades the `bh_assert`
size-equality guard from runtime to compile-time so a future
divergence between V128 and simde_v128_t trips the build
rather than silently miscompiling. Semantically identical to
the pre-fix `bh_memcpy_s` for these fixed-size copies.
…ts/unit Anticipates and addresses common WAMR maintainer review feedback on the relaxed-SIMD PR (#3): * **HIGH — silent AOT mis-compile when RELAXED_SIMD=1 AOT=1**: the shared loader `prepare_bytecode` (`wasm_loader.c`) is reached by AOT/JIT/wamrc too. With this PR's commit 1 it accepts the new sub-opcodes 0x100..0x113, but the AOT path in `core/iwasm/compilation/aot_compiler.c:1494,2463,2639,2799` does `opcode = (uint8)opcode1`, silently aliasing `relaxed_swizzle` (0x100) into `SIMD_v128_load` (0x00) and reading a garbage memarg at codegen time. Reject the combination at cmake-configure time: `WAMR_BUILD_RELAXED_SIMD=1` now requires `WAMR_BUILD_FAST_INTERP=1` and explicitly rejects `WAMR_BUILD_AOT=1 / WAMR_BUILD_JIT=1 / WAMR_BUILD_FAST_JIT=1 / WAMR_BUILD_WAMR_COMPILER=1` with a diagnostic that points at `aot_compiler.c` and says "build fast-interp-only to use relaxed-SIMD until the AOT/JIT pipelines learn the wider sub-opcode range." * **`core/config.h` default for `WASM_ENABLE_RELAXED_SIMD`**: `#ifndef … #define … 0 #endif` block alongside `WASM_ENABLE_SIMD` and `WASM_ENABLE_SIMDE`. Cosmetic but matches WAMR's pattern for every other feature flag — non-cmake builds (e.g. CI lint that compiles a TU in isolation) still see a defined value. * **`tests/unit/relaxed-simd/`**: gtest-based unit test that loads + invokes a hand-encoded wasm module with `f32x4.relaxed_madd`. Two tests: - `load_module_with_relaxed_madd`: asserts the loader accepts the module (pre-PR, this fails with `"invalid opcode 0xfd 100"`). - `invoke_relaxed_madd_returns_fma_result`: invokes the export, asserts the bit pattern of two f32 lanes (`0x42DC0000` = 110.0 and `0x43700000` = 240.0) — both single-rounded FMA hardware and split mul+add produce the same result here since every input/product/sum is exactly representable in f32. Wired into `tests/unit/CMakeLists.txt` next to the parallel `exception-handling` test target. Gated on `WAMR_BUILD_RELAXED_SIMD=1 + WAMR_BUILD_FAST_INTERP=1`. * **Hand-rolled `q15mulr_s` swap → SIMDe intrinsic**: the patch-2 case body for `SIMD_i16x8_relaxed_q15mulr_s` previously had a lane-by-lane fallback loop (because SIMDe doesn't ship a `relaxed_q15mulr_s` intrinsic). SIMDe DOES ship the non-relaxed `simde_wasm_i16x8_q15mulr_sat` (strict-saturating `sqrdmulh.h8` on aarch64), and the relaxed spec explicitly permits saturating behaviour. Swap to that — smaller code, NEON hardware path, bit-identical to the hand loop on the INT16_MIN² overflow boundary (verified locally via `q15mulr_overflow` test case: both produce 0x7ffe7fff7fff). * Docs nit: comment in patch-2 `HANDLE_OP(WASM_OP_SIMD_PREFIX)` referenced `emit_uint16(opcode1)` but the actual call is `wasm_loader_emit_int16(opcode1)`. Fixed. Audit items verified OK without code change: - `clang-format-14` clean across all 5 commits. - `-Wpedantic` not enabled in `build-scripts/warnings.cmake` so the `({ })` GCC statement-expression in the V128 conversion macros is fine. - IR encoding's 2-byte sub-opcode advance via `wasm_loader_emit_int16` is safe on non-unaligned platforms (STORE_U16 with alignment asserts; legacy `emit_byte` also consumed 2 bytes there via padding). - `WASM_ENABLE_SIMDE` is always set when SIMD+FAST_INTERP are set, so the nested `#include "simde/wasm/relaxed-simd.h"` can't be reached without SIMDe being in scope. - `AOT_CURRENT_VERSION` correctly not bumped — no AOT struct changed. References: WAMR PR bytecodealliance#4713 (woodsmc) made tests mandatory in CONTRIBUTING.md; `@lum1n0us`'s PR bytecodealliance#4837 review pattern on fast-interp EH ("follow `tests/unit/interpreter`") shapes the new `tests/unit/relaxed-simd/` layout. CODEOWNERS will route review to `@loganek @lum1n0us @no1wudi @TianlongLiang @yamt`.
…diate Reviewer note (chatgpt-codex-connector on #3): summing all four i8 byte products directly into the i32 lane skipped the i16 truncation point that the spec defines via i16x8.relaxed_dot + extadd_pairwise_i16x8_s. For lanes with a=b=0x80, the previous impl produced 65536+c, which is outside the spec-allowed result set {-65536+c, 65534+c, -1+c} (wrap or saturate at each of two pair sums). Fix preserves the i16 intermediate using wrap, matching the i16x8 dot case immediately above. Worked example, a=b=0x80 in all four lanes: lo_pair = (-128*-128) + (-128*-128) = 32768 (int16)32768 = -32768 (wrap) hi_pair = 32768 → -32768 ext_sum = (i32)-32768 + (i32)-32768 = -65536 result = -65536 + c ✓ wrap+wrap allowed value
Two new tests for the chatgpt-codex-connector finding on #3: 1. `dot_add_i16_intermediate_overflow_regression` — pins the spec-conformant -65536 result for the input pattern that used to produce 65536 (outside the spec-allowed set {-65536, -1, 65534}). Future refactor back to a direct-i32- sum impl fails immediately. 2. `dot_s_i16_overflow_pin_sibling_op` — pins the sibling `i16x8.relaxed_dot_i8x16_i7x16_s` impl at the same overflow boundary. The current impl correctly truncates via the `(int16)sum` cast (wasm_interp_fast.c:8103); the test makes a future refactor that drops the cast loudly fail. Both inputs use a = b = 0x80 in all 16 bytes — the classic case where the i8×i8 pair sum overflows i16 and the truncation point between "i16x8 relaxed dot" and "extadd_pairwise_i16x8_s" distinguishes spec-conformant impls from naive direct-sum impls. Bytecode for both modules was generated via `wat2wasm --enable-relaxed-simd` on minimal known-good WAT (documented inline in the static-array comments) and inlined to avoid a wabt/wat-runtime dependency at test time.
The Coding Guidelines CI check uses `clang-format-14` and flagged
the line break I chose in the previous "preserve i16 intermediate"
commit. Newer clang-format-22 happens to accept both shapes;
clang-format-14 prefers the cast-then-paren-group form:
result.i32x4[lane] =
(int32)((uint32)ext_sum
+ (uint32)v3.i32x4[lane]);
Functionally identical. No behaviour change.
Two more relaxed-SIMD boundary tests in the unit suite, both
exercising implementation-defined behaviors that the dot-product
regression-tests already established for this PR but that weren't
yet covered for these ops:
1. `q15mulr_int16_min_squared_either_sat_or_wrap` — the
INT16_MIN * INT16_MIN case. Spec relaxes the result of
`sat_s((a*b + 0x4000) >> 15)` so an implementation may pick
either the IEEE/x86 PMULHRSW saturate (0x7fff) or the
truncate (0x8000). Test uses *membership* (either of the two
allowed values) rather than exact equality, so a future
switch to wrap doesn't break the test.
2. `madd_inf_times_zero_propagates_nan` — adversarial input for
the fused/unfused FMA path (`f32x4.relaxed_madd`). IEEE 754
§7.2 makes `Inf * 0` an invalid multiply that produces NaN
regardless of the subsequent add, so both `fma(Inf, 0, c)`
and unfused `Inf * 0 + c` produce *some* NaN — but the
specific NaN bit pattern is impl-defined. Test checks each
lane against the IEEE-754 NaN predicate (exp == 0xff and
fraction != 0) rather than an exact bit pattern.
Locally exercised via `iwasm -f`:
q15mulr result: 0x7fff (saturate, current SIMDe lowering)
madd_inf_times_zero result: 0x7fc00000 per lane (canonical f32 NaN)
Both fit the spec-allowed sets the tests describe; the membership
assertions confirm without overfitting to the specific bit
pattern.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the 20 relaxed-SIMD sub-opcodes (0x100..0x113) in the fast-interp
HANDLE_OP(WASM_OP_SIMD_PREFIX)switch and adds aWAMR_BUILD_RELAXED_SIMDcmakeflag (default off — opt-in). Currently those sub-opcodes hit the
"unsupported SIMD opcode"arm atwasm_interp_fast.c:7474. Hand-builtimplementations for the four ops SIMDe doesn't ship (
relaxed_q15mulr_s+ the tworelaxed_dot_i8x16_i7x16*variants); the rest route throughsimde/wasm/relaxed-simd.h.Why we built this: we're replacing WasmEdge with WAMR fast-interp as the wasm
runtime in a pure-interpreter App-Store-eligible app, and the audio DSP
path (a modified version of xmrsplayer) uses
f32x4.relaxed_maddto reach the interpreter-only performance that we need. Without this, fast-interp traps at load on every simd128 workload we have that we introduced to reduce opcode dispatch pressure/overhead in interpreters.Test coverage — three layers, 174 conformance checks:
tests/unit/relaxed-simd/(load + invoke + boundary regressions).Config::relaxed_simd_deterministic(true)mode in our benchmark repo. Thediff-fuzz layer caught a spec-violating impl of
i32x4.relaxed_dot_i8x16_i7x16_add_sbefore submission — an off-by-i16-truncation that produced lane values outside the spec-allowed set. The
upstream spec testsuite did not catch it: every existing assertion stays
within the i16 pair-sum range. Fix is in this PR; corresponding spec-test
addition at WebAssembly/relaxed-simd#164.
WebAssembly/relaxed-simdspec-testsuite assertions wired upthrough fast-interp with
(either …)membership semantics.Cross-microarch benchmarks (M4 Lion P / Sawtooth E / A14 Icestorm / A12 Tempest /
S8 Watch SE2) at
https://github.com/rebeckerspecialties/wasm-benchmark/blob/claude/relaxed-simd-diff-fuzz/README.md#cross-runtime-results-across-apple-silicon-e-cores .
ASan + UBSan + fuzzing part of my local dev loop to find corner cases.
Companion PR: legacy exception support #4949